fix: address unresolved review comments from PyPDF File Processor PR#4743 by RobuRishabh · Pull Request #5173 · llamastack/llama-stack

RobuRishabh · 2026-03-17T05:50:29Z

What does this PR do?

Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.

Key changes:

Remove legacy chunking fallback: Eliminate the _legacy_chunk_file method and all fallback paths from OpenAIVectorStoreMixin. The system now raises a clear RuntimeError if file_processor_api is not configured, instead of silently degrading to legacy inline parsing.
Wire file_processor_api through all vector_io providers: Add Api.file_processors to optional_api_dependencies in the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.
Make files_api required in PyPDF constructors: Remove the default None from both PyPDFFileProcessorAdapter and PyPDFFileProcessor, and use deps[Api.files] (bracket access) in the factory to fail fast if somehow missing.
Bounded file reads: Implemented chunked reading for direct uploads to cap memory usage, and add a size check on the file_id retrieval path against max_file_size_bytes.
Clear error for missing file_id: Wrap openai_retrieve_file in a try/except that surfaces a ValueError("File with id '...' not found"), with a new test covering this case.
Conditional text cleaning: Make the .strip() whitespace-only page filter conditional on the clean_text config setting.
Dead code cleanup: Remove unused file_processor_api field from VectorStoreWithIndex and the now-unused make_overlapped_chunks import from the mixin.

Closes #4743

Test Plan

Automated tests

1. Unit tests (mixin + vector_io)

KMP_DUPLICATE_LIB_OK=TRUE uv run --group unit pytest -sv tests/unit/

All test_contextual_retrieval.py (16 tests) and test_vector_store_config_registration.py tests — these exercise the refactored OpenAIVectorStoreMixin.

2. PyPDF file processor tests (20/20 pass)

uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py

3. Full integration suite (replay mode)

uv run --group test pytest -sv tests/integration/ --stack-config=starter

Result: 4 failed, 54 passed, 639 skipped, 1 xfailed

All 4 failures are pre-existing and unrelated:

test_safety_with_image — Pydantic schema mismatch (type: 'image' vs 'image_url')
test_starter_distribution_config_loads_and_resolves / test_postgres_demo_distribution_config_loads — relative path FileNotFoundError
test_mcp_tools_list_with_schemas — no local MCP server (Connection refused)

No regressions in vector_io, file_search, or ingestion workflows.

Manual E2E verification (with starter distro)

export OLLAMA_URL="http://localhost:11434/v1"
llama stack run starter

1. Verify route is registered:

curl -sS http://localhost:8321/v1/inspect/routes \
  | jq -r '.data[] | select(.route|test("file-processors";"i"))'

Expected:

{
  "route": "/v1alpha/file-processors/process",
  "method": "POST",
  "provider_types": [
    "inline::pypdf"
  ]
}

2. Verify OpenAPI contains the endpoint:

curl -sS http://localhost:8321/openapi.json | rg -n "/v1alpha/file-processors/process|file-processors" -i

3. Direct file upload:

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file=@/path/to/sample.pdf" | jq

Expected: chunks response with metadata.processor = "pypdf".

4. Via file_id:

FILE_ID="$(curl -sS -X POST http://localhost:8321/v1/files \
  -F "file=@/path/to/sample.pdf" -F "purpose=assistants" | jq -r .id)"

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file_id=${FILE_ID}" | jq

Expected: chunks response with metadata.processor = "pypdf" and file_id in chunk metadata.

meta-cla · 2026-03-17T05:50:34Z

Hi @RobuRishabh!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

meta-cla · 2026-03-17T07:07:50Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

…lamastack#4743 - Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin; raise RuntimeError if FileProcessor API is not configured - Wire file_processor_api through all vector_io providers (registry, factories, adapter constructors) - Make files_api required in PyPDF adapter and processor constructors - Implement chunked file reading (64KB) for direct uploads to cap memory usage - Add size check on file_id retrieval path against max_file_size_bytes - Wrap openai_retrieve_file in try/except to surface clear ValueError for missing file_id, with test coverage - Make .strip() page filter conditional on clean_text config - Remove unused file_processor_api field from VectorStoreWithIndex - Clean up dead imports (make_overlapped_chunks) from mixin - fixed linters, formats using pre-commit checks - fixed pypdf to handle .txt files Signed-off-by: roburishabh <roburishabh@outlook.com>

tests/integration/file_processors/test_pypdf_processor.py

cdoern

I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.

src/llama_stack/providers/inline/vector_io/faiss/faiss.py

…s-Unresolved-Reviews

Signed-off-by: roburishabh <roburishabh@outlook.com>

…s-Unresolved-Reviews

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

…s-Unresolved-Reviews

- Add MIME type parsing safety check to prevent IndexError - Document chunked file reading approach and rationale - Make file_processors a hard dependency for all vector_io providers - Add unit test for missing file_processor_api error handling Signed-off-by: roburishabh <roburishabh@outlook.com>

mergify · 2026-03-25T08:46:45Z

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Remove duplicate legacy chunking code that was incorrectly merged alongside the new FileProcessor API path, and fix incomplete RuntimeError syntax. Also remove unused make_overlapped_chunks import Signed-off-by: roburishabh <roburishabh@outlook.com>

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions · 2026-04-07T17:35:13Z

✅ Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

cdoern · 2026-04-07T17:50:01Z

The replay failures are not related to this PR but they block it from merging. All file_search tests that need to be re-recorded fail during replay because request hashes don't match the recordings.

From the server log:

RuntimeError: Recording not found for request hash: bfa38c097b0589ca...
Model: gpt-4o | Request: POST https://api.openai.com/v1/v1/chat/completions

The file_search tests generate new vector store IDs, file IDs, and chunk scores on each run. These end up in the chat completion request body, changing the hash so the replay system can't find the matching recording. The API recorder needs to normalize these non-deterministic fields before hashing for this PR to pass CI.

cdoern · 2026-04-07T17:50:30Z

@franciscojavierarceo @alinaryan @RobuRishabh ^^

cdoern · 2026-04-07T17:51:19Z

that is my best guess at least.

… hashing Signed-off-by: roburishabh <roburishabh@outlook.com>

…s-Unresolved-Reviews

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

…s-Unresolved-Reviews

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

Signed-off-by: roburishabh <roburishabh@outlook.com>

…l results Signed-off-by: roburishabh <roburishabh@outlook.com>

…s-Unresolved-Reviews

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…ool results The patterns match UUID format (8-4-4-4-12 hex digits) as document IDs are UUIDs, not file-IDs in the actual tool results. Updated OpenAI test recordings with new normalization (16 recordings). Deleted Azure test recordings (85 recordings) - these will be regenerated by CI with the new normalization when Azure integration tests run. Verified locally with OpenAI setup in replay mode - 16/16 tests pass. Signed-off-by: roburishabh <roburishabh@outlook.com>

…s-Unresolved-Reviews

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Update test counts in provider matrix to reflect new Azure file_search recordings. Signed-off-by: roburishabh <roburishabh@outlook.com>

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

cdoern

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

franciscojavierarceo · 2026-04-09T14:05:42Z

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

are you saying this wasn't tested locally?

@RobuRishabh I believe you did test this locally, right?

cdoern · 2026-04-09T14:07:27Z

@franciscojavierarceo I am saying that there were a dozen or so commits running pre-commit and re-pushing for integration tests failures both of which could be run locally with pre-commit or integration-tests.sh --inference-mode replay all of the commits makes it hard to figure out if we really need re-recording using the action or if we are unnecessarily re-recording

cdoern · 2026-04-09T14:08:28Z

because most of the recording changes here are not functional: eg. updates to the OpenAI SDK version

RobuRishabh · 2026-04-09T14:15:01Z

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

are you saying this wasn't tested locally?

@RobuRishabh I believe you did test this locally, right?

Yes, I did test this locally. I ran the responses integration tests with --inference-mode record initially, then to minimize the diff I ran it with record-if-missing against the ci-tests server config, and all relevant tests passed the only failures were pre-existing web search environment issues unrelated to my changes. However, since I didn't have Azure and OpenAI API keys, I wasn't initially able to figure out why the recording-dependent tests were failing. When you mentioned pulling the recordings from the branch and comparing them locally, I tried that and it worked and the new recordings were generated successfully. The rest of the tests (unit tests, non-recording-dependent integration tests) were all tested locally

RobuRishabh requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 17, 2026 05:50

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026

RobuRishabh force-pushed the RHAIENG-1823-Address-Unresolved-Reviews branch from 43b1105 to 1eb5352 Compare March 17, 2026 17:36

cdoern reviewed Mar 17, 2026

View reviewed changes

tests/integration/file_processors/test_pypdf_processor.py Show resolved Hide resolved

RobuRishabh requested a review from cdoern March 17, 2026 19:52

mergify bot added 2 commits March 17, 2026 21:45

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

9f453c9

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

8838dcf

cdoern reviewed Mar 18, 2026

View reviewed changes

src/llama_stack/providers/inline/vector_io/faiss/faiss.py Outdated Show resolved Hide resolved

RobuRishabh added 6 commits March 18, 2026 15:13

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

74a8de0

…s-Unresolved-Reviews

fix: add file_processors to vector IO integration test --stack-config

ba12fe7

Signed-off-by: roburishabh <roburishabh@outlook.com>

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

c087c33

…s-Unresolved-Reviews

Merge branch 'RHAIENG-1823-Address-Unresolved-Reviews' of https://git…

d216c3b

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

dc949b3

…s-Unresolved-Reviews

RobuRishabh requested a review from cdoern March 19, 2026 17:08

cdoern and others added 2 commits March 20, 2026 10:09

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

349246f

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

38cc121

mergify bot added the needs-rebase label Mar 25, 2026

Recordings update from CI

054ad36

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

RobuRishabh and others added 15 commits April 7, 2026 15:21

fix: normalize file_search metadata in API recorder for stable replay…

b0c89f9

… hashing Signed-off-by: roburishabh <roburishabh@outlook.com>

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

01d84f1

…s-Unresolved-Reviews

Merge branch 'RHAIENG-1823-Address-Unresolved-Reviews' of https://git…

4074a1f

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

ce902f6

…s-Unresolved-Reviews

Merge branch 'RHAIENG-1823-Address-Unresolved-Reviews' of https://git…

9d78095

…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews

revert: remove unwanted CI recording files from diverged remote

8cde71e

Signed-off-by: roburishabh <roburishabh@outlook.com>

fix(testing): normalize document IDs and citations in file_search too…

483ccea

…l results Signed-off-by: roburishabh <roburishabh@outlook.com>

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

7a1ba6f

…s-Unresolved-Reviews

Recordings update from CI

63aed12

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into RHAIENG-1823-Addres…

ccff0a4

…s-Unresolved-Reviews

Recordings update from CI

2645b4b

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

chore: update provider matrix after Azure recording workflow

5593223

Update test counts in provider matrix to reflect new Azure file_search recordings. Signed-off-by: roburishabh <roburishabh@outlook.com>

Recordings update from CI

c649d94

Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Merge branch 'main' into RHAIENG-1823-Address-Unresolved-Reviews

f524d99

franciscojavierarceo approved these changes Apr 9, 2026

View reviewed changes

RobuRishabh requested a review from cdoern April 9, 2026 13:56

cdoern approved these changes Apr 9, 2026

View reviewed changes

cdoern added this pull request to the merge queue Apr 9, 2026

Merged via the queue into llamastack:main with commit 5d63477 Apr 9, 2026
67 checks passed

Conversation

RobuRishabh commented Mar 17, 2026

What does this PR do?

Test Plan

Automated tests

Manual E2E verification (with starter distro)

Uh oh!

meta-cla bot commented Mar 17, 2026

Action Required

Process

Uh oh!

meta-cla bot commented Mar 17, 2026

Uh oh!

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdoern commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdoern commented Apr 7, 2026

Uh oh!

cdoern commented Apr 7, 2026

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo commented Apr 9, 2026

Uh oh!

cdoern commented Apr 9, 2026

Uh oh!

cdoern commented Apr 9, 2026

Uh oh!

RobuRishabh commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Apr 7, 2026 •

edited

Loading

cdoern commented Apr 7, 2026 •

edited

Loading